textual cue
Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization
Islam, Md Moinul, Kakouros, Sofoklis, Heikkilä, Janne, Oussalah, Mourad
The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.
- North America > Canada > Quebec > Montreal (0.05)
- Europe > Finland > Northern Ostrobothnia > Oulu (0.05)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (3 more...)
Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification
Guo, Chenqi, Rong, Mengshuo, Feng, Qianli, Feng, Rongfan, Ma, Yinglong
Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.
Human-Inspired Long-Term Indoor Localization in Human-Oriented Environment
Zimmerman, Nicky, Sodano, Matteo
Inspired by how humans navigate, required. In fact, there is a trade-off between accuracy we can exploit insights from human navigation to improve and robustness, and each task requires a different blend of long-term localization, which enables robots to navigate the two. For example, for planning and navigating along the in the same environment over extended periods, spanning path of hundreds of meters, robustness (i.e., avoiding jumps several months or even years. In this work, we summarize in the trajectory) is more important, while high accuracy our past contributions to robust long-term localization and is only required in specific end-points (i.e.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Sweden (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
Beyond Trend and Periodicity: Guiding Time Series Forecasting with Textual Cues
Xu, Zhijian, Bian, Yuxuan, Zhong, Jianyuan, Wen, Xiangyu, Xu, Qiang
This work introduces a novel Text-Guided Time Series Forecasting (TGTSF) task. By integrating textual cues, such as channel descriptions and dynamic news, TGTSF addresses the critical limitations of traditional methods that rely purely on historical data. To support this task, we propose TGForecaster, a robust baseline model that fuses textual cues and time series data using cross-attention mechanisms. We then present four meticulously curated benchmark datasets to validate the proposed framework, ranging from simple periodic data to complex, event-driven fluctuations. Our comprehensive evaluations demonstrate that TGForecaster consistently achieves state-of-the-art performance, highlighting the transformative potential of incorporating textual information into time series forecasting. This work not only pioneers a novel forecasting task but also establishes a new benchmark for future research, driving advancements in multimodal data integration for time series models.
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- Information Technology > Data Science > Data Mining (0.92)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Reliability Analysis of Psychological Concept Extraction and Classification in User-penned Text
Garg, Muskan, Sathvik, MSVPJ, Chadha, Amrit, Raza, Shaina, Sohn, Sunghwan
The social NLP research community witness a recent surge in the computational advancements of mental health analysis to build responsible AI models for a complex interplay between language use and self-perception. Such responsible AI models aid in quantifying the psychological concepts from user-penned texts on social media. On thinking beyond the low-level (classification) task, we advance the existing binary classification dataset, towards a higher-level task of reliability analysis through the lens of explanations, posing it as one of the safety measures. We annotate the LoST dataset to capture nuanced textual cues that suggest the presence of low self-esteem in the posts of Reddit users. We further state that the NLP models developed for determining the presence of low self-esteem, focus more on three types of textual cues: (i) Trigger: words that triggers mental disturbance, (ii) LoST indicators: text indicators emphasizing low self-esteem, and (iii) Consequences: words describing the consequences of mental disturbance. We implement existing classifiers to examine the attention mechanism in pre-trained language models (PLMs) for a domain-specific psychology-grounded task. Our findings suggest the need of shifting the focus of PLMs from Trigger and Consequences to a more comprehensive explanation, emphasizing LoST indicators while determining low self-esteem in Reddit posts.
- North America > United States > Minnesota > Olmsted County > Rochester (0.04)
- North America > Canada > Ontario (0.04)
- Asia > India > West Bengal > Kharagpur (0.04)
- Asia > India > Karnataka (0.04)
InterPrompt: Interpretable Prompting for Interrelated Interpersonal Risk Factors in Reddit Posts
Sathvik, MSVPJ, Sarkar, Surjodeep, Saxena, Chandni, Sohn, Sunghwan, Garg, Muskan
Mental health professionals and clinicians have observed the upsurge of mental disorders due to Interpersonal Risk Factors (IRFs). To simulate the human-in-the-loop triaging scenario for early detection of mental health disorders, we recognized textual indications to ascertain these IRFs : Thwarted Belongingness (TBe) and Perceived Burdensomeness (PBu) within personal narratives. In light of this, we use N-shot learning with GPT-3 model on the IRF dataset, and underscored the importance of fine-tuning GPT-3 model to incorporate the context-specific sensitivity and the interconnectedness of textual cues that represent both IRFs. In this paper, we introduce an Interpretable Prompting (InterPrompt)} method to boost the attention mechanism by fine-tuning the GPT-3 model. This allows a more sophisticated level of language modification by adjusting the pre-trained weights. Our model learns to detect usual patterns and underlying connections across both the IRFs, which leads to better system-level explainability and trustworthiness. The results of our research demonstrate that all four variants of GPT-3 model, when fine-tuned with InterPrompt, perform considerably better as compared to the baseline methods, both in terms of classification and explanation generation.
- North America > United States > Minnesota > Olmsted County > Rochester (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
Empowering Fake-News Mitigation: Insights from Sharers' Social Media Post-Histories
Schoenmueller, Verena, Blanchard, Simon J., Johar, Gita V.
Misinformation is a global concern and limiting its spread is critical for protecting democracy, public health, and consumers. We propose that consumers' own social media post-histories are an underutilized data source to study what leads them to share links to fake-news. In Study 1, we explore how textual cues extracted from post-histories distinguish fake-news sharers from random social media users and others in the misinformation ecosystem. Among other results, we find across two datasets that fake-news sharers use more words related to anger, religion and power. In Study 2, we show that adding textual cues from post-histories improves the accuracy of models to predict who is likely to share fake-news. In Study 3, we provide a preliminary test of two mitigation strategies deduced from Study 1 - activating religious values and reducing anger - and find that they reduce fake-news sharing and sharing more generally. In Study 4, we combine survey responses with users' verified Twitter post-histories and show that using empowering language in a fact-checking browser extension ad increases download intentions. Our research encourages marketers, misinformation scholars, and practitioners to use post-histories to develop theories and test interventions to reduce the spread of misinformation.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (21 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Overview (0.92)
- Media > News (1.00)
- Leisure & Entertainment > Sports > Football (1.00)
- Leisure & Entertainment > Sports > Basketball (1.00)
- (5 more...)
AidUI: Toward Automated Recognition of Dark Patterns in User Interfaces
Mansur, SM Hasan, Salma, Sabiha, Awofisayo, Damilola, Moran, Kevin
Past studies have illustrated the prevalence of UI dark patterns, or user interfaces that can lead end-users toward (unknowingly) taking actions that they may not have intended. Such deceptive UI designs can result in adverse effects on end users, such as oversharing personal information or financial loss. While significant research progress has been made toward the development of dark pattern taxonomies, developers and users currently lack guidance to help recognize, avoid, and navigate these often subtle design motifs. However, automated recognition of dark patterns is a challenging task, as the instantiation of a single type of pattern can take many forms, leading to significant variability. In this paper, we take the first step toward understanding the extent to which common UI dark patterns can be automatically recognized in modern software applications. To do this, we introduce AidUI, a novel automated approach that uses computer vision and natural language processing techniques to recognize a set of visual and textual cues in application screenshots that signify the presence of ten unique UI dark patterns, allowing for their detection, classification, and localization. To evaluate our approach, we have constructed ContextDP, the current largest dataset of fully-localized UI dark patterns that spans 175 mobile and 83 web UI screenshots containing 301 dark pattern instances. The results of our evaluation illustrate that \AidUI achieves an overall precision of 0.66, recall of 0.67, F1-score of 0.65 in detecting dark pattern instances, reports few false positives, and is able to localize detected patterns with an IoU score of ~0.84. Furthermore, a significant subset of our studied dark patterns can be detected quite reliably (F1 score of over 0.82), and future research directions may allow for improved detection of additional patterns.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Virginia > Fairfax County > Fairfax (0.04)
- Asia > Singapore (0.04)
- (2 more...)
- Information Technology > Human Computer Interaction > Interfaces (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
Shirani
Type information plays an important role in the success of information retrieval and recommendation systems in software engineering. Thus, the absence of types in dynamically-typed languages poses a challenge to adapt these systems to support dynamic languages. In this paper, we explore the viability of type inference using textual cues. That is, we formulate the type inference problem as a classification problem which uses the textual features in the source code to predict the type of variables. In this approach, a classifier learns a model to distinguish between types of variables in a program. The model is subsequently used to (approximately) infer the types of other variables. We evaluate the feasibility of this approach on four Java projects wherein type information is already available in the source code and can be used to train and test a classifier. Our experiments show this approach can predict the type of new variables with relatively high accuracy (80% F-measure). These results suggest that textual cues can be complementary tools in inferring types for dynamic languages.
Learning to Reason with Relational Video Representation for Question Answering
Le, Thao Minh, Le, Vuong, Venkatesh, Svetha, Tran, Truyen
While acquiring visual knowledge of objects and relations from static images has advanced hugely in recent years [7], How does machine learn to reason about the content of a deep video understanding remains elusive. Compared to video in answering a question? A Video QA system must simultaneously static images, video poses new challenges, primarily due understand language, represent visual content to the inherent dynamic nature of visual content over time over space-time, and iteratively transform these representations [6, 34]. At the lowest level, we have correlated motion in response to lingual content in the query, and finally and appearance [6]. At a higher level, we have objects that arriving at a sensible answer. While recent advances in are persistent over time, actions that are local in time, and textual and visual question answering have come up with the relations that can span over an extended length. Thus sophisticated visual representation and neural reasoning searching for an answer from a video facilitates solving mechanisms, major challenges in Video QA remain on dynamic simultaneous sub-tasks in both the visual and lingual spaces, grounding of concepts, relations and actions to support probably in an iterative and compositional fashion.